ARC-AGI Evaluation Summary

This report shows the performance of various submissions evaluated against the 28_15x15_evaluation set.

Overall Results

SubmissionTotal Score# TasksPercentageMean Pixel CorrectMedian Pixel CorrectWilcoxon p-value (vs. 4o)
submission_4o.json8.502830.36%71.51%75.00%Reference
submission_4omini.json3.502812.5%61.47%66.67%0.1220
submission_agentswtool.json7.502826.79%63.62%62.50%0.3890
submission_finetune4o.json16.002857.14%86.88%100.00%0.0064
submission_finetune4omini.json10.502837.5%73.62%80.00%0.7878

Submission: submission_4o.json

Task: 00576224

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: 17cae0c1

Pair IndexCorrect PixelsPixel %Visualization
01866.67%

Task: 2072aba6

Pair IndexCorrect PixelsPixel %Visualization
02672.22%

Task: 27a77e38

Pair IndexCorrect PixelsPixel %Visualization
08098.77%

Task: 31d5ba1a

Pair IndexCorrect PixelsPixel %Visualization
0746.67%
1640.00%

Task: 34b99a2b

Pair IndexCorrect PixelsPixel %Visualization
01575.00%

Task: 4cd1b7b2

Pair IndexCorrect PixelsPixel %Visualization
016100.00%

Task: 59341089

Pair IndexCorrect PixelsPixel %Visualization
000.00%

Task: 62b74c02

Pair IndexCorrect PixelsPixel %Visualization
000.00%

Task: 66e6c45b

Pair IndexCorrect PixelsPixel %Visualization
016100.00%

Task: 66f2d22f

Pair IndexCorrect PixelsPixel %Visualization
01864.29%

Task: 68b67ca3

Pair IndexCorrect PixelsPixel %Visualization
09100.00%

Task: 6ea4a07e

Pair IndexCorrect PixelsPixel %Visualization
09100.00%
19100.00%

Task: 72207abc

Pair IndexCorrect PixelsPixel %Visualization
087100.00%

Task: 8ba14f53

Pair IndexCorrect PixelsPixel %Visualization
0888.89%

Task: a8610ef7

Pair IndexCorrect PixelsPixel %Visualization
02363.89%

Task: aa18de87

Pair IndexCorrect PixelsPixel %Visualization
03597.22%

Task: b1fc8b8e

Pair IndexCorrect PixelsPixel %Visualization
000.00%
100.00%

Task: bbb1b8b6

Pair IndexCorrect PixelsPixel %Visualization
016100.00%
11062.50%

Task: be03b35f

Pair IndexCorrect PixelsPixel %Visualization
0375.00%

Task: ca8de6ea

Pair IndexCorrect PixelsPixel %Visualization
09100.00%

Task: d017b73f

Pair IndexCorrect PixelsPixel %Visualization
01458.33%

Task: e133d23d

Pair IndexCorrect PixelsPixel %Visualization
0888.89%

Task: e345f17b

Pair IndexCorrect PixelsPixel %Visualization
01275.00%
1850.00%

Task: e633a9e5

Pair IndexCorrect PixelsPixel %Visualization
025100.00%

Task: ed74f2f2

Pair IndexCorrect PixelsPixel %Visualization
0888.89%

Task: ed98d772

Pair IndexCorrect PixelsPixel %Visualization
02055.56%

Task: fc754716

Pair IndexCorrect PixelsPixel %Visualization
05892.06%

Submission: submission_4omini.json

Task: 00576224

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: 17cae0c1

Pair IndexCorrect PixelsPixel %Visualization
0311.11%

Task: 2072aba6

Pair IndexCorrect PixelsPixel %Visualization
02877.78%

Task: 27a77e38

Pair IndexCorrect PixelsPixel %Visualization
07997.53%

Task: 31d5ba1a

Pair IndexCorrect PixelsPixel %Visualization
0853.33%
11066.67%

Task: 34b99a2b

Pair IndexCorrect PixelsPixel %Visualization
0945.00%

Task: 4cd1b7b2

Pair IndexCorrect PixelsPixel %Visualization
0425.00%

Task: 59341089

Pair IndexCorrect PixelsPixel %Visualization
01850.00%

Task: 62b74c02

Pair IndexCorrect PixelsPixel %Visualization
000.00%

Task: 66e6c45b

Pair IndexCorrect PixelsPixel %Visualization
016100.00%

Task: 66f2d22f

Pair IndexCorrect PixelsPixel %Visualization
01967.86%

Task: 68b67ca3

Pair IndexCorrect PixelsPixel %Visualization
0888.89%

Task: 6ea4a07e

Pair IndexCorrect PixelsPixel %Visualization
000.00%
1444.44%

Task: 72207abc

Pair IndexCorrect PixelsPixel %Visualization
08698.85%

Task: 8ba14f53

Pair IndexCorrect PixelsPixel %Visualization
0666.67%

Task: a8610ef7

Pair IndexCorrect PixelsPixel %Visualization
02466.67%

Task: aa18de87

Pair IndexCorrect PixelsPixel %Visualization
03494.44%

Task: b1fc8b8e

Pair IndexCorrect PixelsPixel %Visualization
025100.00%
11872.00%

Task: bbb1b8b6

Pair IndexCorrect PixelsPixel %Visualization
01275.00%
1956.25%

Task: be03b35f

Pair IndexCorrect PixelsPixel %Visualization
0250.00%

Task: ca8de6ea

Pair IndexCorrect PixelsPixel %Visualization
09100.00%

Task: d017b73f

Pair IndexCorrect PixelsPixel %Visualization
000.00%

Task: e133d23d

Pair IndexCorrect PixelsPixel %Visualization
0666.67%

Task: e345f17b

Pair IndexCorrect PixelsPixel %Visualization
01487.50%
11062.50%

Task: e633a9e5

Pair IndexCorrect PixelsPixel %Visualization
01768.00%

Task: ed74f2f2

Pair IndexCorrect PixelsPixel %Visualization
0111.11%

Task: ed98d772

Pair IndexCorrect PixelsPixel %Visualization
02466.67%

Task: fc754716

Pair IndexCorrect PixelsPixel %Visualization
03758.73%

Submission: submission_agentswtool.json

Task: 00576224

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: 17cae0c1

Pair IndexCorrect PixelsPixel %Visualization
0933.33%

Task: 2072aba6

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: 27a77e38

Pair IndexCorrect PixelsPixel %Visualization
08098.77%

Task: 31d5ba1a

Pair IndexCorrect PixelsPixel %Visualization
0960.00%
1746.67%

Task: 34b99a2b

Pair IndexCorrect PixelsPixel %Visualization
01155.00%

Task: 4cd1b7b2

Pair IndexCorrect PixelsPixel %Visualization
016100.00%

Task: 59341089

Pair IndexCorrect PixelsPixel %Visualization
02877.78%

Task: 62b74c02

Pair IndexCorrect PixelsPixel %Visualization
02850.00%

Task: 66e6c45b

Pair IndexCorrect PixelsPixel %Visualization
016100.00%

Task: 66f2d22f

Pair IndexCorrect PixelsPixel %Visualization
01760.71%

Task: 68b67ca3

Pair IndexCorrect PixelsPixel %Visualization
000.00%

Task: 6ea4a07e

Pair IndexCorrect PixelsPixel %Visualization
09100.00%
19100.00%

Task: 72207abc

Pair IndexCorrect PixelsPixel %Visualization
08193.10%

Task: 8ba14f53

Pair IndexCorrect PixelsPixel %Visualization
0555.56%

Task: a8610ef7

Pair IndexCorrect PixelsPixel %Visualization
02363.89%

Task: aa18de87

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: b1fc8b8e

Pair IndexCorrect PixelsPixel %Visualization
025100.00%
12184.00%

Task: bbb1b8b6

Pair IndexCorrect PixelsPixel %Visualization
000.00%
11168.75%

Task: be03b35f

Pair IndexCorrect PixelsPixel %Visualization
0125.00%

Task: ca8de6ea

Pair IndexCorrect PixelsPixel %Visualization
0111.11%

Task: d017b73f

Pair IndexCorrect PixelsPixel %Visualization
01666.67%

Task: e133d23d

Pair IndexCorrect PixelsPixel %Visualization
0555.56%

Task: e345f17b

Pair IndexCorrect PixelsPixel %Visualization
0212.50%
11062.50%

Task: e633a9e5

Pair IndexCorrect PixelsPixel %Visualization
01352.00%

Task: ed74f2f2

Pair IndexCorrect PixelsPixel %Visualization
0111.11%

Task: ed98d772

Pair IndexCorrect PixelsPixel %Visualization
02055.56%

Task: fc754716

Pair IndexCorrect PixelsPixel %Visualization
063100.00%

Submission: submission_finetune4o.json

Task: 00576224

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: 17cae0c1

Pair IndexCorrect PixelsPixel %Visualization
027100.00%

Task: 2072aba6

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: 27a77e38

Pair IndexCorrect PixelsPixel %Visualization
08098.77%

Task: 31d5ba1a

Pair IndexCorrect PixelsPixel %Visualization
01066.67%
1960.00%

Task: 34b99a2b

Pair IndexCorrect PixelsPixel %Visualization
01260.00%

Task: 4cd1b7b2

Pair IndexCorrect PixelsPixel %Visualization
016100.00%

Task: 59341089

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: 62b74c02

Pair IndexCorrect PixelsPixel %Visualization
056100.00%

Task: 66e6c45b

Pair IndexCorrect PixelsPixel %Visualization
016100.00%

Task: 66f2d22f

Pair IndexCorrect PixelsPixel %Visualization
01864.29%

Task: 68b67ca3

Pair IndexCorrect PixelsPixel %Visualization
09100.00%

Task: 6ea4a07e

Pair IndexCorrect PixelsPixel %Visualization
09100.00%
19100.00%

Task: 72207abc

Pair IndexCorrect PixelsPixel %Visualization
087100.00%

Task: 8ba14f53

Pair IndexCorrect PixelsPixel %Visualization
0777.78%

Task: a8610ef7

Pair IndexCorrect PixelsPixel %Visualization
02775.00%

Task: aa18de87

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: b1fc8b8e

Pair IndexCorrect PixelsPixel %Visualization
025100.00%
12184.00%

Task: bbb1b8b6

Pair IndexCorrect PixelsPixel %Visualization
016100.00%
11062.50%

Task: be03b35f

Pair IndexCorrect PixelsPixel %Visualization
0250.00%

Task: ca8de6ea

Pair IndexCorrect PixelsPixel %Visualization
09100.00%

Task: d017b73f

Pair IndexCorrect PixelsPixel %Visualization
01770.83%

Task: e133d23d

Pair IndexCorrect PixelsPixel %Visualization
0777.78%

Task: e345f17b

Pair IndexCorrect PixelsPixel %Visualization
01487.50%
11062.50%

Task: e633a9e5

Pair IndexCorrect PixelsPixel %Visualization
025100.00%

Task: ed74f2f2

Pair IndexCorrect PixelsPixel %Visualization
09100.00%

Task: ed98d772

Pair IndexCorrect PixelsPixel %Visualization
02569.44%

Task: fc754716

Pair IndexCorrect PixelsPixel %Visualization
063100.00%

Submission: submission_finetune4omini.json

Task: 00576224

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: 17cae0c1

Pair IndexCorrect PixelsPixel %Visualization
027100.00%

Task: 2072aba6

Pair IndexCorrect PixelsPixel %Visualization
01027.78%

Task: 27a77e38

Pair IndexCorrect PixelsPixel %Visualization
07997.53%

Task: 31d5ba1a

Pair IndexCorrect PixelsPixel %Visualization
01066.67%
11280.00%

Task: 34b99a2b

Pair IndexCorrect PixelsPixel %Visualization
01155.00%

Task: 4cd1b7b2

Pair IndexCorrect PixelsPixel %Visualization
0425.00%

Task: 59341089

Pair IndexCorrect PixelsPixel %Visualization
01644.44%

Task: 62b74c02

Pair IndexCorrect PixelsPixel %Visualization
04478.57%

Task: 66e6c45b

Pair IndexCorrect PixelsPixel %Visualization
016100.00%

Task: 66f2d22f

Pair IndexCorrect PixelsPixel %Visualization
0932.14%

Task: 68b67ca3

Pair IndexCorrect PixelsPixel %Visualization
09100.00%

Task: 6ea4a07e

Pair IndexCorrect PixelsPixel %Visualization
0222.22%
1444.44%

Task: 72207abc

Pair IndexCorrect PixelsPixel %Visualization
07889.66%

Task: 8ba14f53

Pair IndexCorrect PixelsPixel %Visualization
0666.67%

Task: a8610ef7

Pair IndexCorrect PixelsPixel %Visualization
02466.67%

Task: aa18de87

Pair IndexCorrect PixelsPixel %Visualization
036100.00%

Task: b1fc8b8e

Pair IndexCorrect PixelsPixel %Visualization
025100.00%
125100.00%

Task: bbb1b8b6

Pair IndexCorrect PixelsPixel %Visualization
016100.00%
11062.50%

Task: be03b35f

Pair IndexCorrect PixelsPixel %Visualization
04100.00%

Task: ca8de6ea

Pair IndexCorrect PixelsPixel %Visualization
09100.00%

Task: d017b73f

Pair IndexCorrect PixelsPixel %Visualization
000.00%

Task: e133d23d

Pair IndexCorrect PixelsPixel %Visualization
0888.89%

Task: e345f17b

Pair IndexCorrect PixelsPixel %Visualization
01487.50%
11168.75%

Task: e633a9e5

Pair IndexCorrect PixelsPixel %Visualization
025100.00%

Task: ed74f2f2

Pair IndexCorrect PixelsPixel %Visualization
0666.67%

Task: ed98d772

Pair IndexCorrect PixelsPixel %Visualization
02158.33%

Task: fc754716

Pair IndexCorrect PixelsPixel %Visualization
063100.00%